Data Analyses using SQL

Data Analysis
SQL
Docker
Author

Waseem Waheed

Published

May 28, 2024

1 Introduction

In an effort to showcase my SQL skills as well as challenge myself to improve, I picked two challenges from the 8weeksqlchallenge and solved them. Each challenge involves a relational data from a fictitious business/company.

The purpose of this post is to document my effort in solving two (Section 2 & Section 3) challenge data analyses using SQL.

1.1 Analysis methodology

  1. Explore: explore the data schema, tables, views and attributes.

  2. Profile: check for duplicates in the data and distributions/frequencies.

  3. Clean: clean data which involves fixing typos, type casting, treating null values, and adding categorisations.

  4. Shape/Transform: reshape the data in the format needed for analysis, operations like pivot, unpivot, aggregate, transpose, append, etc.

  5. The final step is to anlyse the data, spotting trends, patterns, conclusions, etc.

1.2 Query writing methodology

The first step to answer a question starts with identifying relevant tables and columns. This can done by consulting the table specification in a documentation, by running a SELECT statement or by doing both.

Once relevant tables and attributes are identified, the SQL queries are developed iteratively and incrementally. Always starting with a basic SELECT and building the query incrementally.

The final step is to perform a sanity checks by limiting the data a small the number of rows to ensure that the calculation are correct.

2 Challenge 1: Danny’s Diner

2.1 Overview

The data for this challenge is sales records from a fictitious restaurant. The dataset has three tables:

  • sales
  • memebers
  • menu

and they are related as illustrated in the following Entity Relational Diagram: ERD

Objective of the challenge is to mine the data for petential areas of business/service improvement that can increase revenue.

2.2 Details

I used Docker to make the analysis easily reproducible by loading two servers:

  • MySQL - opensource database server
  • CloudBeaver - opensource web-based database editor.

Data is loaded into MySQL at instantiation time automatically.

2.3 Questions and SQL queries

  1. What is the total amount each customer spent at the restaurant?

  2. How many days has each customer visited the restaurant?

  3. What was the first item from the menu purchased by each customer?

  4. What is the most purchased item on the menu and how many times was it purchased by all customers?

  5. Which item was the most popular for each customer?

  6. Which item was purchased first by the customer after they became a member?

  7. Which item was purchased just before the customer became a member?

  8. What is the total items and amount spent for each member before they became a member?

  9. If each $1 spent equates to 10 points and sushi has a 2x points multiplier - how many points would each customer have?

  10. In the first week after a customer joins the program (including their join date) they earn 2x points on all items, not just sushi - how many points do customer A and B have at the end of January?

For a detailed description of the challenge check the link.

2.4 Used SQL concepts

A number of SQL concepts were used to answer the questions, often in combinations. Things like * Joins: LEFT JOIN, RIGHT JOIN, INNER JOIN * Keywords: DISTINCT * Functions: GROUP_CONCAT() , DATEDIFF(), FIRST_VALUE() * Common Table Expressions (CTE) * Expressions: CASE

2.5 Conclusions

SQL syntax simplifies the data manipulation and analysis signficantly. Having an understanding of the main concepts made learning the syntax and googling for a solution when stuck much easier.

3 Challenge 2: Pizza Runner

3.1 Overview

The data for this challenge is more challenging as it requires some cleaning as you shall see. It is customer orders and delivery records from a fictitious Pizza restaurant. The dataset has six tables:

  • runner_orders
  • runners
  • customer_orders
  • pizza_names
  • pizza_recipes
  • pizza_toppings

and they are related as illustrated in the following Entity Relational Diagram: ERD

Objective is also to mine the data for petential areas of business/service improvement that can potentially increase revenue.

3.2 Details

I used Docker to make the analysis easily reproducible by loading two servers:

  • MySQL - opensource database server
  • CloudBeaver - opensource web-based database editor.

Data is loaded into MySQL at instantiation time automatically.

3.3 Data cleaning

Initial investigations of the tables revealed that the data required some cleaning to make it suitable for the required analyses.

I have created three new views to that contain cleaned up versions of three tables (runner_orders, customer_orders, pizza_recipes):

3.4 Questions and SQL queries

3.4.1 Pizza Metrics

3.4.2 Runner and Customer Experience

3.4.3 Ingredient Optimization

3.4.4 Pricing and Ratings

3.4.5 Bonus question

For a detailed description of the challenge check the link.

3.5 Used SQL concepts

A number of SQL concepts were used to answer the questions, often in combinations. Things like * Concept: Subqueries, CTE, Views, Union * Joins: LEFT * Keywords: DISTINCT * Functions: IF(), HOUR, WEEKDAY(),DAYNAME(), DAYOFYEAR, TIMEDIFF(), DATE_ADD * Expressions: CASE, ALTER TABLE, DROP TABLE, CREATE TABLE, INSER INTO, UPDATE

3.6 Conclusions

SQL is a joy to work with for especially when the data is relational and whenever the volume of data big that it doesn’t fit in local memory. Having said, I find it still compelling to use SQL even with the data can fit in local memory.